
Feedback should be send to goran.milovanovic@datakolektiv.com.
These notebooks accompany the DATA SCIENCE INTENSIVE SERIES :: Introduction to ML in Python DataKolektiv course.

# - libs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# - set RGN sed
np.random.seed(777)
# - libs
from sklearn import datasets
# - import iris data
iris = pd.read_csv("../_data/iris.csv")
print(iris.head(10))
print(iris.shape)
sepal_length sepal_width petal_length petal_width class 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa 5 5.4 3.9 1.7 0.4 setosa 6 4.6 3.4 1.4 0.3 setosa 7 5.0 3.4 1.5 0.2 setosa 8 4.4 2.9 1.4 0.2 setosa 9 4.9 3.1 1.5 0.1 setosa (150, 5)
It is not difficult to obtain the correlation coefficient from two vectors in numpy:
np.corrcoef(iris['sepal_length'], iris['petal_length'])
array([[1. , 0.87175416],
[0.87175416, 1. ]])
Numpy will always return a full correlation matrix, so:
np.corrcoef(iris['sepal_length'], iris['petal_length'])[0,1]
0.8717541573048712
However, we now want to understand what the correlation coefficient really is. Intuitively, we say that it describes a relationship between two variables. The correlation coefficient - more precisely, the Pearson's correlation coefficient in our case - can very from -1 to +1, describing a negative or positive, strong or week linear relationship between two variables.
You might have stumbled upon various formulas that compute this correlation coefficient. But it is easy to understand what it really us once we introduce a more elementary concept of covariance between two random variables.
Covariance. Given two random variables (RVs), $X$ and $Y$, their (sample) covariance is given by:
$$cov(X,Y) = E[(X-E[X])(Y-E[Y])] = \frac{(X-\bar{X})(Y-\bar{Y})}{N-1}$$# - covariance
v1 = iris['sepal_length']
v2 = iris['petal_length']
# - the expected value of v1 (the mean of v1)
mean_v1 = v1.mean()
# - the expected value of v2 (the mean of v2)
mean_v2 = v2.mean()
diffs = (v1-mean_v1)*(v2-mean_v2)
cov = diffs.sum()/(v1.size-1)
print(cov)
1.273682326621924
Of course we have np.cov() for covariance:
cov = np.cov(v1,v2,bias=False)
print(cov)
[[0.68569351 1.27368233] [1.27368233 3.11317942]]
The covariance of a variable with itself is its variance:
np.var(v1,ddof=1)
0.6856935123042507
iris.plot.scatter(x="sepal_length",y="petal_length")
<AxesSubplot: xlabel='sepal_length', ylabel='petal_length'>
Enters z-score: the standardization of random variables.
Pearson's coefficient of correlation is nothing else than a covariance between $X$ and $Y$ upon their standardization. The standardization of a RV - widely known as a variable z-score - is obtained upon subtracting all of its values from the mean, and dividing by the standard deviation; for the i-th observation of $X$:
$$z(x_i) = \frac{x_i-\bar{X}}{\sigma}$$# - standard deviation of v1
std_v1 = v1.std(ddof=1)
# - standard deviation of v2
std_v2 = v2.std(ddof=1)
# - v1 expressed as z-score
z_v1 = (v1-mean_v1)/std_v1
# - mean of z_v1 is now zero
print(np.round(z_v1.mean(),2))
# - v2 expressed as z-score
z_v2 = (v2-mean_v2)/std_v2
# - mean of z_v2 is now zero
print(np.round(z_v2.mean(),2))
# - covariance between z_v1 and z_v2
cov = np.cov(z_v1, z_v2,bias=True)
print(cov)
-0.0 0.0 [[0.99333333 0.86594246] [0.86594246 0.99333333]]
Now Pearsons's correlation between v1 and v2:
np.corrcoef(v1, v2)[0,1]
0.8717541573048712
So: Pearson's correlation coefficient is just the covariance between standardized variables.
Finally, to determine how much variance is shared between two variables, we square the correlation coefficient to obtain the coefficient of determination, $R^2$
r2 = np.corrcoef(v1, v2)[0,1]**2
print(r2)
0.759955310778326
We have already introduced some of the building blocks for the first model of statistical learning that we will discuss in the step: Simple Linear Regression.
In Simple Linear Regression, we discuss the model of the following functional form:
$$Y = \beta_0 + \beta_1X_1 + \epsilon $$If we assume that the relationship between $X$ and $Y$ is indeed linear - and introduce some additional assumptions that we will discuss in our next session - the following question remains:
What values of $\beta_0$ and $\beta_1$ would pick a line in a plane spawned by $X$ and $Y$ values so that it describes the assumed linear relationship between them the best?
# - fitting the linear model to the data
import statsmodels.api as sm
import statsmodels.formula.api as smf
linear_model = smf.ols(formula='petal_length ~ sepal_length', data=iris).fit()
linear_model.summary()
| Dep. Variable: | petal_length | R-squared: | 0.760 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.758 |
| Method: | Least Squares | F-statistic: | 468.6 |
| Date: | Tue, 22 Nov 2022 | Prob (F-statistic): | 1.04e-47 |
| Time: | 19:58:31 | Log-Likelihood: | -190.49 |
| No. Observations: | 150 | AIC: | 385.0 |
| Df Residuals: | 148 | BIC: | 391.0 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -7.0954 | 0.506 | -14.011 | 0.000 | -8.096 | -6.095 |
| sepal_length | 1.8575 | 0.086 | 21.646 | 0.000 | 1.688 | 2.027 |
| Omnibus: | 0.255 | Durbin-Watson: | 1.204 |
|---|---|---|---|
| Prob(Omnibus): | 0.880 | Jarque-Bera (JB): | 0.384 |
| Skew: | -0.084 | Prob(JB): | 0.825 |
| Kurtosis: | 2.817 | Cond. No. | 43.4 |
The parameters, $\beta_0$ (intercept) and $\beta_1$ (slope)
linear_model.params
Intercept -7.095381 sepal_length 1.857510 dtype: float64
The residuals are what is represented by the model error term $\epsilon$: they represent the difference between the observed and the predicted value:
iris['Prediction'] = linear_model.predict()
iris['Residuals'] = linear_model.resid
iris.head(10)
| sepal_length | sepal_width | petal_length | petal_width | class | Prediction | Residuals | |
|---|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 2.377918 | -0.977918 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 2.006416 | -0.606416 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 1.634914 | -0.334914 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 1.449163 | 0.050837 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 2.192167 | -0.792167 |
| 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa | 2.935171 | -1.235171 |
| 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa | 1.449163 | -0.049163 |
| 7 | 5.0 | 3.4 | 1.5 | 0.2 | setosa | 2.192167 | -0.692167 |
| 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa | 1.077661 | 0.322339 |
| 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa | 2.006416 | -0.506416 |
# - plotting the true data, predicted values and the prediction line
sns.regplot(data=iris, x='sepal_length', y='petal_length', ci=0, line_kws={'color':'red'})
sns.scatterplot(data=iris, x='sepal_length', y='Prediction', color='red', s=50)
sns.despine(offset=10, trim=True)
plt.title('petal_length ~ sepal_length', fontsize=14);
Pearson's $R$ and $R^2$
print("Pearson's correlation (R-value): " + str(round(np.sqrt(linear_model.rsquared), 2)))
print("Coefficient of determination (R^2): " + str(round(linear_model.rsquared, 2)))
Pearson's correlation (R-value): 0.87 Coefficient of determination (R^2): 0.76
Ok, statsmodels can do it; how do we find out about the optimal values of $\beta_0$ and $\beta_1$?
Let's build ourselves a function that (a) tests for some particular values of $\beta_0$ and $\beta_1$ for a particular regression problem (i.e. for a particular dataset) and returns the model error.
The model error? Oh. Remember the residuals:
$$\epsilon_i = y_i - \hat{y_i}$$where $y_i$ is the observation to be predicted, and $\hat{y_i}$ the actual prediction?
Next we do something similar to what happens in the computation of variance, square the differences:
$$\epsilon_i^2 = (y_i - \hat{y_i})^2$$and define the model error for all observations to be the sum of squares:
$$SSE = \sum_{i=1}^{N}(y_i - \hat{y_i})^2$$Obviously, the lower the $SSE$ - the Sum of Squared Error - the better the model! Here's a function that returns the SSE for a given data set (with two columns: the predictor and the criterion) and a choice of parameters $\beta_0$ and $\beta_1$:
# - sse function
def lg_sse(pars):
# - pick up the parameters
beta_0 = pars[0]
beta_1 = pars[1]
# - predict from parameters
preds = beta_0+beta_1*iris['sepal_length']
# - compute residuals
residuals = iris['petal_length']-preds
# - square the residuals
residuals = residuals**2
# - sum of squares
residuals = residuals.sum()
# - out:
return residuals
Test lg_sse() now:
pars = [-7.095381, 1.857510]
print(lg_sse(pars))
111.34802571010499
Check via statsmodels:
(linear_model.resid**2).sum()
111.3480257092054
Method A. Random parameter space search
beta_0 = np.random.uniform(low=-15, high=15, size=10000)
beta_1 = np.random.uniform(low=-15, high=15, size=10000)
random_pars = pd.DataFrame({'beta_0':beta_0, 'beta_1':beta_1})
random_pars.head()
| beta_0 | beta_1 | |
|---|---|---|
| 0 | -10.420088 | -11.461761 |
| 1 | -5.929302 | 8.116856 |
| 2 | -13.138908 | -1.607650 |
| 3 | -1.204190 | 14.371262 |
| 4 | 10.057602 | -6.014701 |
sse = []
for i in range(random_pars.shape[0]):
pars = [random_pars['beta_0'][i],random_pars['beta_1'][i]]
sse.append(lg_sse(pars))
random_pars['sse'] = sse
random_pars.sort_values('sse', ascending=True, inplace=True)
random_pars.head()
| beta_0 | beta_1 | sse | |
|---|---|---|---|
| 584 | -6.673735 | 1.762458 | 114.955420 |
| 1737 | -8.135614 | 2.013342 | 116.350521 |
| 4672 | -7.862939 | 2.015235 | 117.450969 |
| 1190 | -8.103656 | 1.998799 | 118.392934 |
| 2269 | -6.896713 | 1.860619 | 118.401879 |
Check with statsmodels:
linear_model.params
Intercept -7.095381 sepal_length 1.857510 dtype: float64
Not bad, how about 100,000 random pairs?
beta_0 = np.random.uniform(low=-15, high=15, size=100000)
beta_1 = np.random.uniform(low=-15, high=15, size=100000)
random_pars = pd.DataFrame({'beta_0':beta_0, 'beta_1':beta_1})
sse = []
for i in range(random_pars.shape[0]):
pars = [random_pars['beta_0'][i],random_pars['beta_1'][i]]
sse.append(lg_sse(pars))
random_pars['sse'] = sse
random_pars.sort_values('sse', ascending=True, inplace=True)
random_pars.head()
| beta_0 | beta_1 | sse | |
|---|---|---|---|
| 51770 | -7.081345 | 1.856986 | 111.366135 |
| 48344 | -7.476603 | 1.915794 | 111.942890 |
| 1599 | -7.304722 | 1.903305 | 112.071409 |
| 2471 | -7.246934 | 1.894646 | 112.131383 |
| 22482 | -6.498430 | 1.746393 | 113.020430 |
Method B. Grid search
beta_0_vals = np.linspace(-15,15,100)
beta_1_vals = np.linspace(-15,15,100)
grid = np.array([(beta_0, beta_1) for beta_0 in beta_0_vals for beta_1 in beta_1_vals])
grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_0", 1: "beta_1"})
grid.head()
| beta_0 | beta_1 | |
|---|---|---|
| 0 | -15.0 | -15.000000 |
| 1 | -15.0 | -14.696970 |
| 2 | -15.0 | -14.393939 |
| 3 | -15.0 | -14.090909 |
| 4 | -15.0 | -13.787879 |
sse = []
for i in range(grid.shape[0]):
pars = [grid['beta_0'][i],grid['beta_1'][i]]
sse.append(lg_sse(pars))
grid['sse'] = sse
grid.sort_values('sse', ascending=True, inplace=True)
grid.head()
| beta_0 | beta_1 | sse | |
|---|---|---|---|
| 2456 | -7.727273 | 1.969697 | 112.717860 |
| 3055 | -5.909091 | 1.666667 | 115.828053 |
| 2955 | -6.212121 | 1.666667 | 123.135675 |
| 2356 | -8.030303 | 1.969697 | 124.341368 |
| 2556 | -7.424242 | 1.969697 | 128.642562 |
Check with statsmodels:
linear_model.params
Intercept -7.095381 sepal_length 1.857510 dtype: float64
Method C. Optimization (the real thing)
The Method of Least Squares
import scipy as sp
# - sse function
def lg_sse(pars, data):
# - pick up the parameters
beta_0 = pars[0]
beta_1 = pars[1]
# - predict from parameters
preds = beta_0+beta_1*data['sepal_length']
# - compute residuals
residuals = data['petal_length']-preds
# - square the residuals
residuals = residuals**2
# - sum of squares
residuals = residuals.sum()
# - out:
return residuals
# - initial (random) parameter values
init_beta_0 = np.random.uniform(low=-15, high=15, size=1)
init_beta_1 = np.random.uniform(low=-15, high=15, size=1)
init_pars = [init_beta_0, init_beta_1]
# - optimize w. Nelder-Mead
optimal_model = sp.optimize.minimize(
# - fun(parameters, args)
fun=lg_sse,
args = (iris),
x0 = init_pars,
method='Nelder-Mead')
# - optimal parameters
optimal_model.x
/var/folders/d5/r01186gs5wn60_lwmj4g21cm0000gn/T/ipykernel_4211/641499246.py:25: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0. optimal_model = sp.optimize.minimize(
array([-7.09536384, 1.85750582])
Check against statsmodels
linear_model.params
Intercept -7.095381 sepal_length 1.857510 dtype: float64
Final value of the objective function (the model SSE, indeed):
lg_sse(pars=optimal_model.x, data=iris)
111.34802571419112
Check against statsmodels
linear_model.ssr
111.3480257092054
Error Surface Plot: The Objective Function
beta_0_vals = np.linspace(-15,15,100)
beta_1_vals = np.linspace(-15,15,100)
grid = np.array([(beta_0, beta_1) for beta_0 in beta_0_vals for beta_1 in beta_1_vals])
grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_0", 1: "beta_1"})
sse = []
for i in range(grid.shape[0]):
pars = [grid['beta_0'][i],grid['beta_1'][i]]
sse.append(lg_sse(pars, iris))
grid['sse'] = sse
grid.sort_values('sse', ascending=True, inplace=True)
grid.head()
| beta_0 | beta_1 | sse | |
|---|---|---|---|
| 2456 | -7.727273 | 1.969697 | 112.717860 |
| 3055 | -5.909091 | 1.666667 | 115.828053 |
| 2955 | -6.212121 | 1.666667 | 123.135675 |
| 2356 | -8.030303 | 1.969697 | 124.341368 |
| 2556 | -7.424242 | 1.969697 | 128.642562 |
# - import plotly
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
# - Mesh3d: Objective Function
fig = go.Figure(data=[go.Mesh3d(
x=grid['beta_0'],
y=grid['beta_1'],
z=grid['sse'],
color='lightblue',
opacity=0.50)])
fig.update_layout(scene = dict(
xaxis_title='Beta_0',
yaxis_title='Beta_1',
zaxis_title='SSE'),
width=700,
margin=dict(r=20, b=10, l=10, t=10))
fig.show()
Maximum Likelihood Estimation
# - normal distribution
from scipy.stats import norm
# - nll function
def lg_nll(pars, data):
# - pick up the parameters
beta_0 = pars[0]
beta_1 = pars[1]
sigma = np.abs(pars[2])
# - predict from parameters
preds = beta_0+beta_1*data['sepal_length']
# - compute residuals
residuals = data['petal_length']-preds
# - negative log-likelihood
nll = -np.sum(norm.logpdf(residuals, loc=0, scale=sigma))
# - out:
return nll
# - initial (random) parameter values
init_beta_0 = np.random.uniform(low=-15, high=15, size=1)
init_beta_1 = np.random.uniform(low=-15, high=15, size=1)
init_sigma = np.random.uniform(low=.1, high=10, size=1)
init_pars = [init_beta_0, init_beta_1, init_sigma]
# - optimize w. Nelder-Mead
optimal_model = sp.optimize.minimize(
# - fun(parameters, args)
fun=lg_nll,
args = (iris),
x0 = init_pars,
method='Nelder-Mead',
options={'maxiter':10e6,
'maxfev':10e6,
'fatol':10e-12})
# - optimal parameters
optimal_model.x
/var/folders/d5/r01186gs5wn60_lwmj4g21cm0000gn/T/ipykernel_4211/1801224425.py:26: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0.
array([-7.09538126, 1.85750961, 0.86157992])
Check against statsmodels
linear_model.params
Intercept -7.095381 sepal_length 1.857510 dtype: float64
Log Likelihood
loglike = -lg_nll(optimal_model.x, data=iris)
print(loglike)
-190.49268265220655
Check against statsmodels
linear_model.llf
-190.49268265220212
The Likelihood Function
beta_0_vals = np.linspace(-10,0,300)
beta_1_vals = np.linspace(0,3,300)
sigma = optimal_model.x[2]
grid = np.array([(beta_0, beta_1) for beta_0 in beta_0_vals for beta_1 in beta_1_vals])
grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_0", 1: "beta_1"})
nll = []
for i in range(grid.shape[0]):
pars = [grid['beta_0'][i],grid['beta_1'][i],sigma]
nll.append(lg_nll(pars, iris))
grid['likelihood'] = nll
grid['likelihood'] = np.exp(-grid['likelihood'])
grid.sort_values('likelihood', ascending=False, inplace=True)
grid.head()
| beta_0 | beta_1 | likelihood | |
|---|---|---|---|
| 26285 | -7.090301 | 1.856187 | 1.860885e-83 |
| 26884 | -7.023411 | 1.846154 | 1.840102e-83 |
| 25686 | -7.157191 | 1.866221 | 1.830589e-83 |
| 25387 | -7.190635 | 1.876254 | 1.780881e-83 |
| 27483 | -6.956522 | 1.836120 | 1.769939e-83 |
# - Mesh3d: Objective Function
fig = go.Figure(data=[go.Mesh3d(
x=grid['beta_0'],
y=grid['beta_1'],
z=grid['likelihood'],
color='red',
opacity=0.50)])
fig.update_layout(scene = dict(
xaxis_title='Beta_0',
yaxis_title='Beta_1',
zaxis_title='Likelihood'),
width=700,
margin=dict(r=20, b=10, l=10, t=10))
fig.show()
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.